Starting in the mid 1930ās, the Home Ownerās Loan Corporation (HOLC) started appraise building conditions in cities across the country. This information was used to generate maps that marked the levels of risk for residential mortgage lenders. Each region was given a letter grade and color. The highest grade is āAā and was colored green on the maps. The lowest grade, representing the highest risks, is āDā and was colored red. The problem is that these rankings were heavily influenced by racial discrimination. The supporting documents that came with the maps describes the āDā ranked areas as follows:
"The fourth grade or D areas represent those neighborhoods in which the things that are now taking place in the C neighborhoods, have already happened. They are characterized by detrimental influences in a pronounced degree, undesirable population or an infiltration of it. Low percentage of home ownership, very poor maintenance and often vandalism prevail. Unstable incomes of the people and difficult collections are usually prevalent. The areas are broader than the so-called slum districts. Some mortgage lenders may refuse to make loans in these neighborhoods and others will lend only on a conservative basis."
ā The Baltimore āRedliningā Map: Ranking Neighborhoods
Redlining was coined to describe banks and other institutions practice of not investing in areas that were given a grade of āDā, i.e. colored red on the HOLC maps. Since minorities were specifically target by the appraisers, it was minority communities that disproportionally suffered from redlining. This form of economic racial discrimination is one of the main arguments for reparations. However, to begin the discussion of reparations, we first need to understand the current, ongoing, economic effects of redlining.
For more information regarding redlining see the National Community Reinvestment Coalition report on the continuing effects of redlining:
HOLC āredliningā maps: The persistent structure of segregation and economic inequality
For a summary, see the Washington Post article about the report
Redlining was banned 50 years ago. Itās still hurting minorities today
Digitizations of the redlining maps are available at
Mapping Inequality
And for Baltimore specific information, see
The Baltimore āRedliningā Map: Ranking Neighborhoods
The purpose of this project is to analyze the current effects of redlining on Baltimore Cityās housing market. Towards this end, I analyzed four measures of the housing market, the percent of lots that are vacant, the percent of foreclosed units, the percent of units sold, and the median sales price. To determine the effects of redlining, I tested the quality of indicator the redlining map is of the above statistics.
The analysis was done using python and a few core packages; Pandas, GeoPandas, Folium, matplotlib, and StatsModels. (Each name is a link to documentation on the package) The data manipulation was done using Pandas and then exported to GeoPandas for easier integration with Folium, the package used to make the interactive maps. matplotlib was used to make the histograms of the data and the statistical analysis was done with StatsModels.
import pandas as pd
import geopandas as gpd
import folium
import shapely
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import math
import matplotlib.pyplot as plt
The redlining data comes in the form of a digitized map from Mapping Inequality in the form of GeoJson, a standardized format to store geographic data.
The modern data comes from the Open Baltimore site, a collection of publicly available data about Baltimore. This data is organized by Census Tracts and Block Groups. (for more background see link) The geographies of each of these block groups was gotten from Census Bureau an converted into a GeoJson online using MyGeodata Cloud.
redline_data = gpd.read_file("MDBaltimore1937.geojson")
data = gpd.read_file("mygeodata/tl_2010_24510_bg10.geojson")
housing_data = pd.read_csv("2011_Housing_Market_Typology.csv")
The first step in the data manipulation process was adding a hex code color column for each of the redlining grade colors.
def redlining_colormap(holc_grade):
if holc_grade == 'A':
return "#00ff00" # Green
elif holc_grade == 'B':
return "#0000ff" # Blue
elif holc_grade == 'C':
return "#ffff00" # Yellow
elif holc_grade == 'D':
return "#ff0000" # Red
else:
return "#ffffff"
# Applys the above function to each element of redline_data['holc_grade']
redline_data['color'] = redline_data['holc_grade'].apply(redlining_colormap)
In order to judge the effects of redlining, the overlap between each redlining area and each census block group was found and recorded as a percentage of the block group that was covered by each color of HOLC grading. The area of the overlap was calculated using shapely.
perc_green = []
perc_blue = []
perc_yellow = []
perc_red = []
# iterate through each row of the Census block group data
for index1,c_row in data.iterrows():
c_district = c_row['geometry'] # the polygon representing the current Census block group
c_area = c_district.area # the area of the polygon
is_green = False
is_blue = False
is_yellow = False
is_red = False
# iterate through each row of the redlining map data searching for overlaps
for index2,r_row in redline_data.iterrows():
r_area = r_row['geometry']
# check if the block group and the redlining region overlap
if c_district.intersects(r_area):
# if they do overlap, get the area
overlap_area = c_district.intersection(r_area).area
# the overlap includes lines, which have an area of 0, so we will ignore them
if (overlap_area > 0):
if r_row['holc_grade'] == 'A':
if is_green:
perc_green[index1] += overlap_area/c_area
else:
is_green = True
perc_green.append(overlap_area/c_area)
elif r_row['holc_grade'] == 'B':
if is_blue:
perc_blue[index1] += overlap_area/c_area
else:
is_blue = True
perc_blue.append(overlap_area/c_area)
elif r_row['holc_grade'] == 'C':
if is_yellow:
perc_yellow[index1] += overlap_area/c_area
else:
is_yellow = True
perc_yellow.append(overlap_area/c_area)
elif r_row['holc_grade'] == 'D':
if is_red:
perc_red[index1] += overlap_area/c_area
else:
is_red = True
perc_red.append(overlap_area/c_area)
# add a 0 to the list of percent overlap of each color if there were no overlaps of that color
if not is_green:
perc_green.append(0.0)
if not is_blue:
perc_blue.append(0.0)
if not is_yellow:
perc_yellow.append(0.0)
if not is_red:
perc_red.append(0.0)
data['perc_green'] = perc_green
data['perc_blue'] = perc_blue
data['perc_yellow'] = perc_yellow
data['perc_red'] = perc_red
The GeoID in the Baltimore data does not come with preceding zeros, so those are added here.
housing_data['blockGroup'] = housing_data['blockGroup'].apply(lambda x: str(x) if (len(str(x)) == 7) else ('0'+str(x)))
30 out of the 653 Census block groups in Baltimore City do not have any associated data in the Baltimore data base used. Those areas are recorded here and saved to be used on the map below.
missing = []
# iterate through the Census block group data
for index,d_row in data.iterrows():
found = False
# iterate through the Open Baltimore housing data
for index2,h_row in housing_data.iterrows():
if ('24510' + h_row['blockGroup']) == d_row['GEOID10']:
found = True
break
if not found:
missing.append(d_row['geometry'])
Here the data is converted into GeoPandas DataFrames for ease of integration with Folium.
# Created a copy of the data DataFrame to preserve its values
temp = com_data.copy()
# Convert to a GeoPandas DataFrame
census_geo = gpd.GeoDataFrame(temp, geometry=temp['geometry'])
# The CRS tells folium what projection the geometry's units are in
census_geo.crs = {'init': 'epsg:4269'}
# same as above
missing_df = pd.DataFrame(columns=['geometry'], data=missing)
missing_gdf = gpd.GeoDataFrame(missing_df, geometry=missing_df['geometry'])
missing_gdf.crs = {'init': 'epsg:4269'}
Below is a map of Baltimore City overlaid with the HOLC redlining map and the Census block groups. The bright blue areas are the block groups missing in the Baltimore data. Each layer can be turned on and off using the layer control menu in the top right of the map.
# make the map and set the starting location
map_c = folium.Map(location=[39.29, -76.61], zoom_start=11)
# Add the redlining map layer
folium.GeoJson(redline_data, name='Redlining Map', style_function=lambda feature: {
'fillColor': feature['properties']['color'],
'color': feature['properties']['color'],
'weight': 1,
'fillOpacity': 0.5,
}).add_to(map_c)
# Add the matched Census block groups
folium.GeoJson(census_geo, name='Census Districts', style_function=lambda feature: {
'fillColor': "#333333",
'color': "#000000",
'weight': 0.5,
'fillOpacity': 0.1
}).add_to(map_c)
# Add the missing Census block groups
folium.GeoJson(missing_gdf, name='Missing Census Districts', style_function=lambda feature: {
'fillColor': "#00ffff",
'color': "#000000",
'weight': 0.5,
'fillOpacity': 0.5
}).add_to(map_c)
# Add the layer control menu
folium.LayerControl().add_to(map_c)
map_c # show the map
Here the map data generated above, recording the percent coverage of each HOLC grading value is merged with the Baltimore housing data. This is done using the GeoID, a 10 digit identifier used by the Census Bureau. (For more information, see this link)
The categories are as follows:
An interesting note, one Census block group had a foreclosure percentage of 450. This caused major problems with the statistical analysis and seemed erroneous, so I shifted that value to 100.
Due to the skewness of the data, the log of each category was also recorded. For more information regarding log transformation in linear models see this link
# create a new DataFrame which will hold the combined data
com_data = pd.DataFrame(columns=['GEOID10', 'geometry', 'area', 'perc_green', 'perc_blue', 'perc_yellow', 'perc_red', 'ratio_vacant', 'ratio_foreclosed', 'ratio_sales', 'median_sale_price', 'log_ratio_vacant', 'log_ratio_foreclosed', 'log_ratio_sales', 'log_median_sale_price'])
for index,d_row in data.iterrows():
# the total area of the district in square miles (ALAND is in square meters)
area = d_row['ALAND10']/2590000.0
# data that can be direct copied from the Census data
new_row = {'GEOID10': d_row['GEOID10'],
'geometry': d_row['geometry'],
'area': area,
'perc_green': d_row['perc_green'],
'perc_blue': d_row['perc_blue'],
'perc_yellow': d_row['perc_yellow'],
'perc_red': d_row['perc_red']}
# finding the corresponding entry in the Open Baltimore data
for index2,h_row in housing_data.iterrows():
if ('24510' + h_row['blockGroup']) == d_row['GEOID10']:
new_row['ratio_vacant'] = h_row['vacantLots']
if new_row['ratio_vacant'] == 0:
new_row['log_ratio_vacant'] = 0
else:
new_row['log_ratio_vacant'] = math.log(new_row['ratio_vacant'])
if h_row['foreclosureFilings'] > 100: # dealing with the foreclosure percentage of 450
new_row['ratio_foreclosed'] = 100
else:
new_row['ratio_foreclosed'] = h_row['foreclosureFilings']
if new_row['ratio_foreclosed'] == 0:
new_row['log_ratio_foreclosed'] = 0
else:
new_row['log_ratio_foreclosed'] = math.log(new_row['ratio_foreclosed'])
if h_row['unitsPerSquareMile'] == 0:
new_row['ratio_sales'] = 0
else:
# Adjusting the sales to control for number of units
new_row['ratio_sales'] = h_row['sales20092010']/h_row['unitsPerSquareMile']*area
if new_row['ratio_sales'] == 0:
new_row['log_ratio_sales'] = 0
else:
new_row['log_ratio_sales'] = math.log(new_row['ratio_sales'])
new_row['median_sale_price'] = h_row['medianSalesPrice20092010']
if new_row['median_sale_price'] == 0:
new_row['log_median_sale_price'] = 0
else:
new_row['log_median_sale_price'] = math.log(new_row['median_sale_price'])
break
com_data = com_data.append(pd.Series(new_row), ignore_index=True)
com_data
com_data = com_data.dropna()
Below is another map with layers showing each of the raw categories listed above. The legend for each layer is shown at the top of the map
map_d = folium.Map(location=[39.29, -76.61], zoom_start=11)
folium.Choropleth(
geo_data=census_geo[['GEOID10', 'geometry']],
name='Ratio of Vacancies',
data=census_geo,
columns=['GEOID10', 'ratio_vacant'],
key_on='feature.properties.GEOID10',
fill_color='Blues',
fill_opacity=0.5,
line_opacity=0.7,
legend_name='Ratio of Vacancies',
show=False).add_to(map_d)
folium.Choropleth(
geo_data=census_geo[['GEOID10', 'geometry']],
name='Ratio of Foreclosures',
data=census_geo,
columns=['GEOID10', 'ratio_foreclosed'],
key_on='feature.properties.GEOID10',
fill_color='OrRd',
fill_opacity=0.7,
line_opacity=0.7,
legend_name='Ratio of Foreclosures',
show=False).add_to(map_d)
folium.Choropleth(
geo_data=census_geo[['GEOID10', 'geometry']],
name='Ratio of Sales',
data=census_geo,
columns=['GEOID10', 'ratio_sales'],
key_on='feature.properties.GEOID10',
fill_color='Purples',
fill_opacity=0.7,
line_opacity=0.7,
legend_name='Ratio of Sales',
show=False).add_to(map_d)
folium.Choropleth(
geo_data=census_geo[['GEOID10', 'geometry']],
name='Median Sales Price',
data=census_geo,
columns=['GEOID10', 'median_sale_price'],
key_on='feature.properties.GEOID10',
fill_color='BuGn',
fill_opacity=0.5,
line_opacity=0.7,
legend_name='Median Sales Price',
show=False).add_to(map_d)
folium.GeoJson(redline_data, name='Redlining Map', style_function=lambda feature: {
'fillColor': feature['properties']['color'],
'color': feature['properties']['color'],
'weight': 0.7,
'fillOpacity': 0.3,
}).add_to(map_d)
folium.LayerControl().add_to(map_d)
map_d
Below is a statistical analysis of each raw category
plt.hist(com_data['ratio_vacant'])
plt.title("Histogram of Ratio Vacant")
plt.xlabel("Percentage")
X = sm.add_constant(com_data[['perc_green', 'perc_blue', 'perc_yellow', 'perc_red']])
smmodel_v = sm.OLS(com_data['ratio_vacant'], X)
smfit_v = smmodel_v.fit()
print(smfit_v.summary())
The important result from this test is the probability of the F-statistic. This is the measure of the certainty of the models result. A probability of greater than 0.05 means that we cannot reject the null hypothesis that the independent variable, the ratio of vacant lots in this case, is independent of the dependent variables, the percent of the area of the block district that was covered by each HOLC grading grade.
In this case, the probability of the F-statistic is 1.19e-12. This number is effectively 0, meaning that there is a strong correlation between redlining map coverage and the ratio of vacant lots.
plt.hist(com_data['ratio_foreclosed'])
plt.title("Histogram of Ratio Foreclosed")
plt.xlabel("Percentage")
smmodel_f = sm.OLS(com_data['ratio_foreclosed'], X)
smfit_f = smmodel_f.fit()
print(smfit_f.summary())
Here, the probability of the F-statistic is not less than 0.05, so we cannot reject the null hypothesis of no correlation.
plt.hist(com_data['ratio_sales'])
plt.title("Histogram of Ratio Sales")
plt.xlabel("Percentage")
smmodel_s = sm.OLS(com_data['ratio_sales'], X)
smfit_s = smmodel_s.fit()
print(smfit_s.summary())
plt.hist(com_data['median_sale_price'])
plt.title("Histogram of Median Sales Prices")
plt.xlabel("Dollars")
smmodel_p = sm.OLS(com_data['median_sale_price'], X)
smfit_p = smmodel_p.fit()
print(smfit_p.summary())
Both ratio sales and the median sales price had probabilities less than 0.05, so the model suggests that both categories are strongly correlated with the redlining map.
Of further note, in the graphs of the data above, all the categories are heavily skewed to the right. In an effort to combat this and improve the accuracy of the models, we will now look at the logās of each category.
Below is the same map as above but with each category transformed using log. This has the added benefit of making the variations more distinguishable on the map.
map_dl = folium.Map(location=[39.29, -76.61], zoom_start=11)
folium.Choropleth(
geo_data=census_geo[['GEOID10', 'geometry']],
name='Log of Ratio of Vacancies',
data=census_geo,
columns=['GEOID10', 'log_ratio_vacant'],
key_on='feature.properties.GEOID10',
fill_color='Blues',
fill_opacity=0.5,
line_opacity=0.7,
legend_name='Log of Ratio of Vacancies',
show=False).add_to(map_dl)
folium.Choropleth(
geo_data=census_geo[['GEOID10', 'geometry']],
name='Log of Ratio of Foreclosures',
data=census_geo,
columns=['GEOID10', 'log_ratio_foreclosed'],
key_on='feature.properties.GEOID10',
fill_color='OrRd',
fill_opacity=0.7,
line_opacity=0.7,
legend_name='Log og Ratio of Foreclosures',
show=False).add_to(map_dl)
folium.Choropleth(
geo_data=census_geo[['GEOID10', 'geometry']],
name='Log of Ratio of Sales',
data=census_geo,
columns=['GEOID10', 'log_ratio_sales'],
key_on='feature.properties.GEOID10',
fill_color='Purples',
fill_opacity=0.7,
line_opacity=0.7,
legend_name='Log of Ratio of Sales',
show=False).add_to(map_dl)
folium.Choropleth(
geo_data=census_geo[['GEOID10', 'geometry']],
name='Log of Median Sales Price',
data=census_geo,
columns=['GEOID10', 'log_median_sale_price'],
key_on='feature.properties.GEOID10',
fill_color='BuGn',
fill_opacity=0.5,
line_opacity=0.7,
legend_name='Log of Median Sales Price',
show=False).add_to(map_dl)
folium.GeoJson(redline_data, name='Redlining Map', style_function=lambda feature: {
'fillColor': feature['properties']['color'],
'color': feature['properties']['color'],
'weight': 0.7,
'fillOpacity': 0.3,
}).add_to(map_dl)
folium.LayerControl().add_to(map_dl)
map_dl
plt.hist(com_data['log_ratio_vacant'])
plt.title("Histogram of the Log of Ratio Sales")
plt.xlabel("Log of Percentage")
smmodel_vl = sm.OLS(com_data['log_ratio_vacant'], X)
smfit_vl = smmodel_vl.fit()
print(smfit_vl.summary())
While the probability of the F-statistic went down, we can see in the histogram that the data being modeled is far more normal (in the statistical sense). This leads to a higher accuracy of the model. Of note, this category is bimodal with modes around 0 and 1.75.
plt.hist(com_data['log_ratio_foreclosed'])
plt.title("Histogram of the Lof of Ratio Sales")
plt.xlabel("Log of Percentage")
smmodel_fl = sm.OLS(com_data['log_ratio_foreclosed'], X)
smfit_fl = smmodel_fl.fit()
print(smfit_fl.summary())
Here we can see a big improvement with the transformed data. Before the model was not statistically significant, but afterwards we get a probability of 10e-9, which is significant.
plt.hist(com_data['log_ratio_sales'])
plt.title("Histogram of the Log of Ratio Sales")
plt.xlabel("Log of Percentage")
smmodel_sl = sm.OLS(com_data['log_ratio_sales'], X)
smfit_sl = smmodel_sl.fit()
print(smfit_sl.summary())
Here we see another improvement, from significant to incredibly significant.
plt.hist(com_data['log_median_sale_price'])
plt.title("Histogram of the Log of the Median Sales Price")
plt.xlabel("Log of Dollars")
smmodel_pl = sm.OLS(com_data['log_median_sale_price'], X)
smfit_pl = smmodel_pl.fit()
print(smfit_pl.summary())
This category behaved similarly to ratio vacant under the transformation and while the data is more normal produced a lower probability. However, looking at the graph of the data, there is a cluster of values at 0. I believe that this is from block groups where there was no recorded median sales price (either due to the lack of residential sales or the data was not recorded). Thus, I reran the analysis below with those zeros removed.
median_zero_count = 0
for index,row in com_data.iterrows():
if row['median_sale_price'] == 0:
median_zero_count += 1
print(median_zero_count)
There are 22 districts that have a listed median sale price of 0
To preserve the original data, I make a copy of the data DataFrame for this analysis. Then I replace each 0 in median sales price with a NaN and remove all rows with a NaN.
data_price = com_data.copy()
data_price['median_sale_price'] = data_price['median_sale_price'].apply(lambda x: x if (x>0) else np.nan)
data_price = data_price.dropna()
price_geo = gpd.GeoDataFrame(data_price, geometry=data_price['geometry'])
price_geo.crs = {'init': 'epsg:4269'}
Once again the map of Baltimore City with the redlining map overlay but now with the median sales price and transformed median sales price data.
map_dp = folium.Map(location=[39.29, -76.61], zoom_start=11)
folium.Choropleth(
geo_data=price_geo[['GEOID10', 'geometry']],
name='Median Sales Price',
data=price_geo,
columns=['GEOID10', 'median_sale_price'],
key_on='feature.properties.GEOID10',
fill_color='BuGn',
fill_opacity=0.5,
line_opacity=0.7,
legend_name='Median Sales Price',
show=False).add_to(map_dp)
folium.Choropleth(
geo_data=price_geo[['GEOID10', 'geometry']],
name='Log of Median Sales Price',
data=price_geo,
columns=['GEOID10', 'log_median_sale_price'],
key_on='feature.properties.GEOID10',
fill_color='BuGn',
fill_opacity=0.5,
line_opacity=0.7,
legend_name='Log of Median Sales Price',
show=False).add_to(map_dp)
folium.GeoJson(redline_data, name='Redlining Map', style_function=lambda feature: {
'fillColor': feature['properties']['color'],
'color': feature['properties']['color'],
'weight': 0.7,
'fillOpacity': 0.3,
}).add_to(map_dp)
folium.LayerControl().add_to(map_dp)
map_dp
plt.hist(data_price['median_sale_price'])
plt.title("Histogram of the Cleaned Median Sales Price")
plt.xlabel("Dollars")
Xp = sm.add_constant(data_price[['perc_green', 'perc_blue', 'perc_yellow', 'perc_red']])
smmodel_pc = sm.OLS(data_price['median_sale_price'], Xp)
smfit_pc = smmodel_pc.fit()
print(smfit_pc.summary())
plt.hist(data_price['log_median_sale_price'])
plt.title("Histogram of the Log of the Cleaned Median Sales Price")
plt.xlabel("Log of Dollars")
smmodel_pcl = sm.OLS(data_price['log_median_sale_price'], Xp)
smfit_pcl = smmodel_pcl.fit()
print(smfit_pcl.summary())
The removals of the zeros greatly increased the accuracy of the model, resulting in a far more significant result.
While it was exciting to get highly significant models for each category, it is evidence of a darker story. The HOLC was terminated in 1953 yet its legacy lives on. Despite the nearly 70 years that have passed since then, and almost a hundred since the making of the HOLC maps, the changes have small enough that the grading done at the time is an incredibly good predictor of current housing conditions in those regions. However, all hope is not lost. Working for the coefficients of each model, the greatest negative indicator is not the grade āDā regions, but the yellow, grade āCā regions. Upon further analysis of the maps, there are a number of regions that were red on the redlining map but are not in the bottom quarter or areas on any of the statistics. A further study would need to be made regarding the processes that changed these regions. But that is not to say that there is not still work to be done. The red and yellow regions still tend to be worse off than the green and blue regions. This analysis also does not cover the movement of racial groups over the intervening 85 odd years. I would not count it as an improvement if the housing conditions were forcibly relocating the original inhabitants.